-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support selecting different hash functions in hash_partition #6726
Support selecting different hash functions in hash_partition #6726
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a unit test with identity hash and unsupported data type, otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no test for fatal assertion when the hash function is not compatible with the datatype. For example, using identity hash function on string column. Need to wait until #6696 is merged.
We should not rely on release_assert
for communicating errors to the user. release_assert
is a last resort as it is an unrecoverable error that requires restarting the process.
Any updates on reviewing this PR? |
@gaohao95 Can you please merge the conflicts, then we are ready to merge it. |
…elect-hash-partition
The conflicts should be addressed. |
@@ -775,11 +777,25 @@ std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition( | |||
table_view const& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this function templated? Then you can remove the runtime failures and switch statement below, right? That would let errors show up at compile-time instead of run-time, which is convenient for developers.
So for the caller, it would be:
hash_partition<IndentityHash>(input, columns_to_hash, num_paritions, steam, mr);
instead of:
hash_partition(input, columns_to_hash, num_paritions, hash_id::HASH_IDENTITY, steam, mr);
Also, you have written your check for data type twice, once here and once above with the if_enable_t
. That is repeating yourself. Instead of doing the test twice, with this change, now you will only need to do it once. Also this function will remain the same length instead of growing by 12 lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be a template because most libcudf users are dynamic/interpreted languages like Python/Spark where the relevant information isn't known until runtime. Making the hash function a template parameter would just force the caller to do the switch statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay!
template <typename return_type = result_type> | ||
CUDA_HOST_DEVICE_CALLABLE std::enable_if_t<!std::is_arithmetic<Key>::value, return_type> | ||
operator()(const Key& key) const | ||
{ | ||
release_assert(false && "IdentityHash does not support this data type"); | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary? I think that this code turns a compile time error into a runtime error, which is not good for developers because it will cause them to find their coding errors later.
If you remove this, does the code still compile? If so then this code is never generating anything so it can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function object is invoked via the type_dispatcher
which will instantiate it for all possible libcudf types. We need to provide a valid instantiation for all types. This includes types that should never actually be invoked (as seen above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so if these lines of code are removed then cudf will no longer compile successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case yes. static_cast
a string to an integer should fail at compile time.
@jrhemstad Do you have any other suggestions for this PR? Can you approve it? |
ok to test |
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
add to allowlist |
…elect-hash-partition
Jake on vacation and review addressed.
@gaohao95 please merge the latest from branch-0.17 and then add the missing include in partitioning.hpp. |
…elect-hash-partition
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #6726 +/- ##
============================================
Coverage 81.94% 81.94%
============================================
Files 96 96
Lines 16166 16166
============================================
Hits 13247 13247
Misses 2919 2919 Continue to review full report at Codecov.
|
This PR is to allow hash partitioning to configure the seed of its hash function. As noted in #6307, using the same hash function in hash partitioning and join leads to a massive hash collision and severely degrades join performance on multiple GPUs. There was an initial fix (#6726) to this problem, but it added only the code path to use identity hash function in hash partitioning, which doesn't support complex data types and thus cannot be used in general. In fact, using the same general Murmur3 hash function with different seeds in hash partitioning and join turned out to be a sufficient fix. This PR is to enable such configurations by making `hash_partition` accept an optional seed value. Authors: - Wonchan Lee (https://github.com/magnatelee) Approvers: - https://github.com/gaohao95 - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec - Jake Hemstad (https://github.com/jrhemstad) URL: #7771
This PR intends to
hash_partition
to select a different hash function (e.g. identity hash function) in additional toMurmurHash3_32
. (Close [FEA] Support selecting different hash functions in hash_partition #6307)hash_partition
implementation insrc/hash/hashing.cu
.Restrictions: